Introduction¶

**Team Members**: Christian Peralta, Camille Beatrice Valera, Sebastian Urs Wey

For the Machine Learning Kaggle Competition 2022, we were given a dataset derived from the United States Census Bureau (USCB) https://www.census.gov/. The USBC conducts various yearly surveys; as well as the decennial census, which produces data about the U.S. population and its economy. Data obtained essentially enables federal and local governments to make educated decisions regarding the allocation of federal funds, international trade, health, housing, and other influential elements to the standard of living.

Based on the demographic predictors of the USBC dataset, our main objectives is to model and predict the income level of U.S. residents participating in the 2020-2021 survey. In the subsequent sections, we outline the specific approaches and thought-process undertaken for the following:

  • [Section 1] Exploratory Data Analysis
  • [Section 2] Model Preparation and Selection
  • [Section 3] Model Evaluation
  • [Section 4] Conclusion and Lessons Learned

Supplementary notebooks have also been included to demonstrate other methods tested and further justify the final model selected for submission.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import timeit
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import ParameterGrid, StratifiedKFold, GridSearchCV, cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics import accuracy_score
from yellowbrick.classifier import ROCAUC, confusion_matrix
In [ ]:
# Importing the datasets

test_df = pd.read_csv('test.csv')
train_df = pd.read_csv('train.csv')

[1] Exploratory Data Analysis¶

In the training data set (train_df), we have 41 columns, including the response variable high_income. There are a total of 54607 entries with many features containing NAs. The fraction of NAs differ between the features and can represent more than 50% in some variables (e.g. mig_prev_sunbelt).

In [ ]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54607 entries, 0 to 54606
Data columns (total 41 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                54607 non-null  int64  
 1   occ_code_level2   54607 non-null  int64  
 2   age               29892 non-null  float64
 3   stock_dividends   29720 non-null  float64
 4   mig_chg_msa       26888 non-null  object 
 5   tax_filer_stat    29121 non-null  object 
 6   det_hh_summ       54607 non-null  object 
 7   mig_prev_sunbelt  26888 non-null  object 
 8   hisp_origin       54377 non-null  object 
 9   education         29786 non-null  object 
 10  wage_per_hour     29712 non-null  float64
 11  capital_losses    29850 non-null  float64
 12  vet_question      54607 non-null  object 
 13  own_or_self       54607 non-null  int64  
 14  country_self      53532 non-null  object 
 15  mig_move_reg      26888 non-null  object 
 16  high_income       54607 non-null  int64  
 17  hs_college        29771 non-null  object 
 18  class_worker      54607 non-null  object 
 19  mig_same          54607 non-null  object 
 20  unemp_reason      29642 non-null  object 
 21  state_prev_res    54423 non-null  object 
 22  ind_code_level2   54607 non-null  int64  
 23  race              29759 non-null  object 
 24  country_mother    52818 non-null  object 
 25  capital_gains     29716 non-null  float64
 26  sex               30073 non-null  object 
 27  ind_code_level1   54607 non-null  object 
 28  citizenship       54607 non-null  object 
 29  union_member      54607 non-null  object 
 30  fam_under_18      54607 non-null  object 
 31  marital_stat      29126 non-null  object 
 32  region_prev_res   54607 non-null  object 
 33  mig_chg_reg       26888 non-null  object 
 34  country_father    52581 non-null  object 
 35  occ_code_level1   54607 non-null  object 
 36  full_or_part_emp  54607 non-null  object 
 37  weeks_worked      30410 non-null  float64
 38  det_hh_fam_stat   54607 non-null  object 
 39  num_emp           54607 non-null  int64  
 40  vet_benefits      54607 non-null  int64  
dtypes: float64(6), int64(7), object(28)
memory usage: 17.1+ MB

We further observe that the first feature id is redundant, as we assume that the data is already well shuffled. In addition to NaN (which represent NAs), we often observe an entry called Not in universe. This entry should not be confused with NA since it indicates that the person completing the questionnaire is not concerned by the question. Other notable obsevations we discovered was that some categorical variables contain a high number of different categories, e.g. the variable education or the variable country_father.

In [ ]:
train_df.head(10)
Out[ ]:
id occ_code_level2 age stock_dividends mig_chg_msa tax_filer_stat det_hh_summ mig_prev_sunbelt hisp_origin education ... marital_stat region_prev_res mig_chg_reg country_father occ_code_level1 full_or_part_emp weeks_worked det_hh_fam_stat num_emp vet_benefits
0 1 0 42.0 0.0 NaN Nonfiler Householder NaN All other 11th grade ... NaN Not in universe NaN United-States Not in universe Not in labor force 0.0 Householder 0 2
1 2 18 56.0 NaN NaN NaN Householder NaN All other High school graduate ... Married-civilian spouse present Not in universe NaN United-States Sales Full-time schedules NaN Householder 1 2
2 3 26 26.0 NaN NaN Joint both under 65 Householder NaN All other High school graduate ... NaN Not in universe NaN Haiti Adm support including clerical Full-time schedules NaN Householder 3 2
3 4 0 67.0 NaN MSA to MSA Joint one under 65 & one 65+ Householder No All other NaN ... NaN Northeast Same county United-States Not in universe Children or Armed Forces 0.0 Householder 0 1
4 5 0 NaN NaN Nonmover Nonfiler Child under 18 never married Not in universe All other Children ... NaN Not in universe Nonmover United-States Not in universe Children or Armed Forces NaN Child <18 never marr not in subfamily 0 0
5 6 36 NaN NaN NonMSA to nonMSA NaN Householder No All other NaN ... NaN South Same county United-States Machine operators assmblrs & inspctrs Children or Armed Forces NaN Nonfamily householder 6 2
6 7 26 40.0 NaN Nonmover Single Householder Not in universe All other High school graduate ... Divorced Not in universe Nonmover United-States Adm support including clerical Children or Armed Forces 52.0 Householder 6 2
7 8 11 39.0 250.0 Nonmover Single Householder Not in universe All other NaN ... NaN Not in universe Nonmover United-States Professional specialty Children or Armed Forces 52.0 Nonfamily householder 6 2
8 9 0 51.0 0.0 Nonmover Nonfiler Householder Not in universe Other Spanish 1st 2nd 3rd or 4th grade ... NaN Not in universe Nonmover Puerto-Rico Not in universe Children or Armed Forces 0.0 Householder 0 2
9 10 0 44.0 0.0 Nonmover Head of household Householder Not in universe All other Some college but no degree ... Divorced Not in universe Nonmover United-States Not in universe Children or Armed Forces 26.0 Householder 2 2

10 rows × 41 columns

Overlapping Variables: The following examples show that some variables are obviously overlapping. For example, ind_code_level and ind_code_level2 contain the exact same number of answers for some categories.

In [ ]:
train_df["ind_code_level1"].value_counts().head(7)
Out[ ]:
Not in universe or children          21436
Retail trade                          4363
Manufacturing-durable goods           3531
Education                             2725
Finance insurance and real estate     2535
Manufacturing-nondurable goods        2312
Other professional services           2088
Name: ind_code_level1, dtype: int64
In [ ]:
train_df["ind_code_level2"].value_counts().head(7)
Out[ ]:
0     21436
33     4363
43     2725
45     2088
4      1884
42     1658
37     1411
Name: ind_code_level2, dtype: int64

The same is true for the variables det_hh_fam_stat and det_hh_summ, where the variable det_hh_summ summarizes some variables of det_hh_fam_stat. For instance, the number of householder in det_hh_summ is 26295, which is equal to the number of householder (19830) plus the number of nonfamily householder (6465) in the variable det_hh_fam_stat.

In [ ]:
train_df["det_hh_fam_stat"].value_counts().head(7)
Out[ ]:
Householder                                 19830
Spouse of householder                       10981
Child <18 never marr not in subfamily       10280
Nonfamily householder                        6465
Child 18+ never marr Not in a subfamily      2526
Secondary individual                         1578
Other Rel 18+ ever marr not in subfamily      445
Name: det_hh_fam_stat, dtype: int64
In [ ]:
train_df["det_hh_summ"].value_counts().head(7)
Out[ ]:
Householder                             26295
Spouse of householder                   10982
Child under 18 never married            10299
Child 18 or older                        3043
Other relative of householder            2055
Nonrelative of householder               1893
Group Quarters- Secondary individual       29
Name: det_hh_summ, dtype: int64

The variable hs_college can also be considered redundant, as it only contains few information which is already included in the variable education.

In [ ]:
train_df["hs_college"].value_counts()
Out[ ]:
Not in universe          28231
High school               1015
College or university      525
Name: hs_college, dtype: int64

Skewed variables: We also discovered that few variables are heavily skewed: stock_dividends, wage_per_hour, capital_losses, capital_gains

In [ ]:
# Capital gains
sns.displot(data=train_df['capital_gains'], kind="hist", kde=True).set(title='Distribution of capital gains');

# Capital losses
sns.displot(data=train_df['capital_losses'], kind="hist", kde=True).set(title="Distribution of capital losses");

# Stock dividends
sns.displot(data=train_df['stock_dividends'], kind="hist", kde=True).set(title="Distribution of stock_dividends");

# Wage per hour
sns.displot(data=train_df['wage_per_hour'], kind="hist", kde=True).set(title="Distribution of wage_per_hour");

1.1 Feature Engineering¶

Once exploratory data analysis was completed, our team initially opted to pursue different strategies in order to surpass the first baseline. One team member was able to obtain a Kaggle score of 86% using an optimized random forest classifier and SimpleImputer (strategy=mean) prior to any data cleaning and including redundant variables such as id. While this approach generated a high score, the resulting training data set was immense and computationally demanding due to the high number of different categories which were converted into dummy variables. Alternatively, another team member applied an appraoch utilizing "cleaned data" -- this ultimately resulted in lower scores on the public leaderboard, which made it difficult for us to agree on which variables should be removed or summarized.

We finally implemented the following:

  • We dropped the variable id. Given the description of the dataset, we theorize the that the dataframe is well-shuffled and the variable id should not contain any information for the prediction.
  • As state_prev_res is a more detailed variable of region_prev_res, we decided to keep the latter.
  • The variables ind_code_level2 and ind_code_level1 contain the same information. However, ind_code_level2 is encoded and seems to be more detailed than ind_code_level1. We can see this in the lists above which show the value counts for both variables. Therefore, we decided to keep the summarized variable ind_code_level1.
  • The variable det_hh_summ is a summarized version of the variable det_hh_fam_stat. We decided to drop det_hh_fam_stat and keep det_hh_summ.
  • We dropped the variable hs_college because it should presumably contain the same information as the variable education.
  • For the variables country_self, country_mother and country_father, we summarized entries as non-US or US.
  • We encoded the variable education with OrdinalEncoder given that it can be ordered in a logical way. The other qualitative variables were transformed into dummy variables.

These decisions are encoded in the function DataCleaning() which we then applied to both the test and the train data set.

In [ ]:
# create a function which can be applied on both dataframes
def DataCleaning(df):
  # remove variables
  df.drop(columns=['id', 'state_prev_res', 'ind_code_level2', 'det_hh_fam_stat', 'hs_college'], inplace=True)
  # summarize categories
  df["country_self"].replace("(?m)^(?!United-States).*", "Non-US", regex=True, inplace=True)
  df["country_father"].replace("(?m)^(?!United-States).*", "Non-US", regex=True, inplace=True)
  df["country_mother"].replace("(?m)^(?!United-States).*", "Non-US", regex=True, inplace=True)
  # encode variable education
  ordenc = OrdinalEncoder(categories=['Children', 'Less than 1st grade', '1st 2nd 3rd or 4th grade', '5th or 6th grade', '7th and 8th grade', '9th grade', 
               '10th grade', '11th grade', '12th grade no diploma', 'High school graduate', 'Some college but no degree',
                'Associates degree-occup /vocational', 'Associates degree-academic program', 'Bachelors degree(BA AB BS)',
                'Masters degree(MA MS MEng MEd MSW MBA)', 'Doctorate degree(PhD EdD)', 'Prof school degree (MD DDS DVM LLB JD)'],
                handle_unknown="use_encoded_value", unknown_value=np.nan)
  df["education"] = ordenc.fit_transform(df.loc[:,['education']])
  # return dataframe
  return df
In [ ]:
# Apply DataCleaning() on both datasets
train_df = DataCleaning(train_df)
test_df = DataCleaning(test_df)

Addtional Data Preparation: In order to make the data accessible for different models, further modifications are needed. They can be summarized as follows:

  1. We appended the Kaggle test data set to our train data set. This must be done in order to apply the same modifications to both dataframes, so that the trained models can predict on the Kaggle test data set.
  2. We performed NA-Imputing on our data. After evaluation, we concluded that different imputing strategies lead to different performances (see Supplementary Notebook NAImputation for further information and conclusions). We finally applied SimpleImputer to impute the most frequent category for categorical variables, the mean for numeric and non-skewed variables, and the median for skewed variables.
  3. For categorical variables, we created dummy variables.
  4. We separated the Kaggle test data set from the train data set once again.
  5. Given that we have to train our final model on the complete train data set, we saved a copy of the training data set for the final model fitting.
  6. For the different models - especially for stacking - we subdivided the training data set into a real training data set and a test data set. We also created copies from those vectors/dataframes in order to reset before the different models.
In [ ]:
# 1. Add column for later separation. Append test_df to train_df.
train_df['train_test'] = 'train'
test_df['train_test'] = 'test'
df_all = train_df.append(test_df)
In [ ]:
# 1.1. Transform Categorical Variables
cols = list(df_all.select_dtypes(include=['object']).columns)
df_all[cols] = df_all[cols].astype('category')
In [ ]:
# 2.1. NA-Imputation on categorical data (most_frequent)
variables_cat = list(df_all.select_dtypes(include=['category']).columns)
imp = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
df_cat = pd.DataFrame(imp.fit_transform(df_all[variables_cat]))
df_cat.columns=df_all[variables_cat].columns
In [ ]:
# 2.2 NA-Imputation on numeric variables (mean for non-skewed, median for skewed)
variables_numeric = list(set(df_all.columns) - set(variables_cat))
variables_skewed = ["stock_dividends", "wage_per_hour", "capital_losses", "capital_gains"]
variables_nonskewed = list(set(variables_numeric) - set(variables_skewed))
# Imputation : non- skewed variables
imp = SimpleImputer(missing_values=np.NaN, strategy='mean')
df_nonskewed = pd.DataFrame(imp.fit_transform(df_all[variables_nonskewed]))
df_nonskewed.columns=df_all[variables_nonskewed].columns
# Imputation: skewed variables
imp = SimpleImputer(missing_values=np.NaN, strategy='median')
df_skewed = pd.DataFrame(imp.fit_transform(df_all[variables_skewed]))
df_skewed.columns=df_all[variables_skewed].columns
In [ ]:
# 2.3 Concat dataframes
df_all = pd.concat([df_skewed, df_nonskewed, df_cat], axis = 1)
In [ ]:
# 3. Create Dummy Variables: apply get_dummies()
df_all = pd.get_dummies(df_all, drop_first=True)
In [ ]:
# 4. Separate dataframes again
# train_df
train_df = df_all.loc[df_all.train_test_train==1]
train_df = train_df.drop("train_test_train", axis=1)
# Kaggle test dataset
X_kaggle = df_all.loc[df_all.train_test_train==0]
X_kaggle = X_kaggle.drop("train_test_train", axis=1)
X_kaggle = X_kaggle.drop("high_income", axis=1)
In [ ]:
# 5. Create a copy of train_df for later use in final model fitting
train_df_original = train_df.copy()
In [ ]:
# 6. Split train_df (for model comparison)
X_comp = train_df.drop("high_income", axis=1)
y_comp = train_df["high_income"]

X_train, X_test, y_train, y_test = train_test_split(X_comp, y_comp, test_size=0.1, shuffle=True, random_state=12)
In [ ]:
# 6.1 Create copies of X_train, X_test, y_train and y_test
X_train_original = X_train.copy()
X_test_original = X_test.copy()
y_train_original = y_train.copy()
y_test_original = y_test.copy()

[2] Model Selection¶

The primary objetive of the problem is to predict whether an individual either has high income or low income given the data set. Hence, it is a classification problem. To address this challenge our team decided to start with the simplest models available and later use more complex models.

In all cases, each model performance was tested with and without hyperparameter tuning. Models which performed with at least 80% accuracy on the test set were retained for stacking. All of the "best" performing models were then further optimized and evaluated to select the final model. Other models which were below the 80% baseline we set are described in the supplmentary notebooks.

2.1 Random Forest¶

In this subsection, a random forest classifier is fitted on our data. The initial accuracy for a prediction on the separated test data set was 85.6%. This initial score was improved by parameter tuning (values printed at the end of the code chunks). To ensure that the number of trees (n_estimators) is high enough, we cross-validated two different number of trees (theoretically, no overfitting can occur when taking a value too high). We approached the values in the parameter grid initially by using out-of-bag score. This is computationally faster. However, it is not directly comparable to the cross-validation score used for other models, which is why we finally tune the parameters using cross-validation.

Random forest on such a big dataset is computationally very demanding. By setting max_samples = 0.1, only 10% of the data is used for each split, which accelerates the process. Another approach could have been to further summarize the categorical variables so that less dummy variables are created.

The final obtained best parameters are:

  • n_estimators = 250
  • max_features = 18
  • min_samples_leaf = 1
  • max_depth = None
In [ ]:
# Restore training and test data set
X_train = X_train_original.copy()
X_test = X_test_original.copy()
y_train = y_train_original.copy()
y_test = y_test_original.copy()
In [ ]:
# Model without Optimizazion
rfc = RandomForestClassifier(random_state=12)
rfc.fit(X_train, y_train)

print(f"Accuracy score with default parameters : {accuracy_score(y_true=y_test, y_pred=rfc.predict(X_test))}")
Accuracy score with default parameters : 0.854971616919978
In [ ]:
# Optimization
start_time = timeit.default_timer()

# Number of trees in random forest
n_estimators = [100, 250]
# Number of features to consider at every split
max_features = ['sqrt', 16, 18, 20, 22, 24, 26, 28, 30]
# number of samples required at each leaf node
min_samples_leaf = [1, 3, 5]
# max_depth
max_depth = [20, 35, None]

# Create the random grid
param_grid = {'n_estimators': n_estimators,
            'max_features': max_features,
               'min_samples_leaf': min_samples_leaf,
                  'max_depth': max_depth}

# Create the Classifier. For computational reasons, use max_samples=0.1
rfc = RandomForestClassifier(random_state=12, max_samples=0.1)

# Instantiate the grid search model
folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
grid_cv=GridSearchCV(estimator=rfc, param_grid=param_grid, scoring="accuracy", cv=folds, n_jobs=-1)

# Fit the grid search to the data
grid_cv.fit(X_train, y_train)

best_param = grid_cv.best_params_
best_score = grid_cv.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)

elapsed = timeit.default_timer() - start_time
elapsed_min = elapsed/60
print("Time [min]:", elapsed_min)

# Kaggle Score: 0.86098
Best parameters: {'max_depth': None, 'max_features': 30, 'min_samples_leaf': 1, 'n_estimators': 250}
Best score: 0.8605380097324472
Time [min]: 43.14529777925
In [ ]:
# Save mean CV score for model evaluation
rfc_meanCV = grid_cv.best_score_

# Save SD for model evaluation
rfc_stdCV = grid_cv.cv_results_['std_test_score'][grid_cv.best_index_]/np.sqrt(5)
In [ ]:
# Model with optimization
rfc_final = grid_cv.best_estimator_

rfc_final.fit(X_train, y_train)

rfc_test_score = accuracy_score(y_true=y_test, y_pred=rfc_final.predict(X_test))
print("Accuracy score on the separated test data set:", rfc_test_score)
Accuracy score on the separated test data set: 0.8699871818348288

2.2 Gradient Boosting¶

Similar to the random forest model, we first created a gradient boosting classifier model utilizing its default settings in order to have a baseline reference for the optimization steps.

In [ ]:
# Recall training and test data set
X_train = X_train_original.copy()
X_test = X_test_original.copy()
y_train = y_train_original.copy()
y_test = y_test_original.copy()
In [ ]:
# Baseline Model (without Optimizazion)
GBC = GradientBoostingClassifier(random_state=12)
GBC.fit(X_train,y_train)

print(f"Accuracy score with default parameters: {accuracy_score(y_true=y_test, y_pred=GBC.predict(X_test))}") 

#Compare to KAGGLE Import Score = 0.85109 (without optimization)

Optimization was conducted step-wise based on the parameter's importance, or impact on the overall model:

  • Step 1 - Searches for the optimal learning_rate (which is the shrinkage parameter λ), as well as the n_estimators (number of trees B).
  • Step 2 - Involves max_depth (limits the number of nodes in the tree).
  • Step 3 - Optimizes the min_samples_split (minimum number of samples required to split an internal node).
  • Step 4 - Optimizes the min_samples_leaf (minimum number of samples required to be at a leafOnode).
  • Step 5 - Optimizes the max_features (number of features to consider when looking for the best split).

For reference, the baseline, or default values for the parameters optimized are:

  • learning_rate = 0.1
  • n_estimators = 100
  • max_depth = 3
  • min_samples_split = 2
  • min_samples_leaf = 1
  • max_features = None
In [ ]:
# OPTIMIZATION STEP 1 (LEARNING RATE and NUMBER OF ESTIMATORS)

#Creating the GridSearchCV parameters
parameters = {
    "learning_rate": [0.01, 0.1, 0.2, 0.3],
    "n_estimators":[325,330,350,400]
    }
gb = GradientBoostingClassifier(random_state=12, subsample=0.1)

#Passing the scoring function in the GridSearchCV
folds = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)
GBC1 = GridSearchCV(estimator=gb, param_grid=parameters, scoring='accuracy', cv=folds, n_jobs=-1)

#Fitting the Model
GBC1.fit(X_train, y_train)

best_param = GBC1.best_params_
best_score = GBC1.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
Best parameters: {'learning_rate': 0.1, 'n_estimators': 325}
Best score: 0.8629593805860478
In [ ]:
# Model with Optimization STEP 1
GBC_fit1= GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, random_state=12)
GBC_fit1.fit(X_train, y_train)

GBC_accuracy_fit1 = accuracy_score(y_true=y_test, y_pred=GBC_fit1.predict(X_test))
print(GBC_accuracy_fit1)

#Score with learning_rate: 0.1 with n_estimators: 250 = 0.8727
#Score with learning_rate: 0.1 with n_estimators: 300 = 0.8722 
#Score with learning_rate: 0.1 with n_estimators: 310 = 0.87328
#Score with learning_rate: 0.1 with n_estimators: 320 = 0.87369
#Score with learning_rate: 0.1 with n_estimators: 325 = 0.874199
#Score with learning_rate: 0.1 with n_estimators: 330 = 0.875114
0.8751144479033144
In [ ]:
# OPTIMIZATION STEP 2 (MAX DEPTH)

#Creating the GridSearchCV parameters
parameters_2 = {
     "max_depth":[2,3,4,5]
    }
gb1 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, random_state=12, subsample=0.1)

#Passing the scoring function in the GridSearchCV
folds = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)
GBC2 = GridSearchCV(estimator=gb1, param_grid=parameters_2, scoring='accuracy', cv=folds, n_jobs=-1)

#Fitting the Model
GBC2.fit(X_train, y_train)

best_param = GBC2.best_params_
best_score = GBC2.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
Best parameters: {'max_depth': 3}
Best score: 0.8624710297873909
In [ ]:
# Model with Optimization Step 2
GBC_fit2 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=4, random_state=12)
GBC_fit2.fit(X_train, y_train)

GBC_accuracy_fit2 = accuracy_score(y_true=y_test, y_pred=GBC_fit2.predict(X_test))
print(GBC_accuracy_fit2)

#Score with learning_rate: 0.1 with n_estimators: 325, max_depth:2 = 0.868522
#Score with learning_rate: 0.1 with n_estimators: 325, max_depth:3 = 0.874199
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth:3 = 0.875114
0.8743819813221022
In [ ]:
# OPTIMIZATION STEP 3 (MIN SAMPLES SPLIT)

#Creating the GridSearchCV parameters
parameters_3 = {
     "min_samples_split":[2,3,4,5]
    }
gb2 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, random_state=12, subsample=0.1)

#Passing the scoring function in the GridSearchCV
folds = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)
GBC3 = GridSearchCV(estimator=gb2, param_grid=parameters_3, scoring='accuracy', cv=folds, n_jobs=-1)

#Fitting the Model
GBC3.fit(X_train, y_train)

best_param = GBC3.best_params_
best_score = GBC3.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
Best parameters: {'min_samples_split': 3}
Best score: 0.8626745217067396
In [ ]:
# Model with Optimization Step 3
GBC_fit3 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, random_state=12)
GBC_fit3.fit(X_train, y_train)

GBC_accuracy_fit3 = accuracy_score(y_true=y_test, y_pred=GBC_fit3.predict(X_test))
print(GBC_accuracy_fit3)

#Score with learning_rate: 0.1 with n_estimators: 325, max_depth:3, min_samples_split: 2 = 0.8733
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth:3, min_samples_split: 2 = 0.875114
0.8751144479033144
In [ ]:
# OPTIMIZATION STEP 4 (MIN SAMPLES LEAF)

#Creating the GridSearchCV parameters
parameters_4 = {
     "min_samples_leaf":[1,3,5,7]
    }
gb3 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, random_state=12, subsample=0.1)

#Passing the scoring function in the GridSearchCV
folds = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)
GBC4 = GridSearchCV(estimator=gb3, param_grid=parameters_4, scoring='accuracy', cv=folds, n_jobs=-1)

#Fitting the Model
GBC4.fit(X_train, y_train)

best_param = GBC4.best_params_
best_score = GBC4.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
Best parameters: {'min_samples_leaf': 5}
Best score: 0.8659504774782038
In [ ]:
# Model with Optimization Step 4
GBC_fit4 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, min_samples_leaf=5, random_state=12)
GBC_fit4.fit(X_train, y_train)

GBC_accuracy_fit4 = accuracy_score(y_true=y_test, y_pred=GBC_fit4.predict(X_test))
print(GBC_accuracy_fit4)

#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 5 = 0.87218
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 3 = 0.87347
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 2 = 0.87438
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 1 = 0.87511
0.8721845815784655
In [ ]:
# OPTIMIZATION STEP 5 (MAX FEATURES)

#Creating the GridSearchCV parameters
parameters_5 = {
     "max_features":['sqrt',20,50,None]
    }
gb4 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, min_samples_leaf=5, random_state=12, subsample=0.1)

#Passing the scoring function in the GridSearchCV
folds = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)
GBC5 = GridSearchCV(estimator=gb4, param_grid=parameters_5, scoring='accuracy', cv=folds, n_jobs=-1)

#Fitting the Model
GBC5.fit(X_train, y_train)

best_param = GBC5.best_params_
best_score = GBC5.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
Best parameters: {'max_features': None}
Best score: 0.8659504774782038
In [ ]:
# Model with Optimization Step 5
GBC_fit5 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, min_samples_leaf=5, max_features=None, random_state=12)
GBC_fit5.fit(X_train, y_train)

GBC_accuracy_fit5 = accuracy_score(y_true=y_test, y_pred=GBC_fit5.predict(X_test))
print(GBC_accuracy_fit5)

#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 5, max_features: None = 0.87218
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 1, max_features: None = 0.87511
0.8721845815784655

As seen in the optimization performed above, certain steps demonstrated that the default parameters were the most appropriate for parameters: learning_rate, max_depth, min_samples_split, and max_features.

On the other hand, optimization for n_estimators and min_samples_leaf suggested that 330 and 5 were the "best" values respectively. Used in prediction, n_estimators = 330 resulted in a higher accuracy score while min_samples_leaf = 5 yielded lower results compared to its default setting of 1. (See Model with Optimization Step 5 for the difference in scores.)

Thus, the final gradient boosting classification model parameter values selected were:

  • learning_rate = 0.1
  • n_estimators = 330
  • max_depth = 3
  • min_samples_split = 2
  • min_samples_leaf = 1
  • max_features = None
In [ ]:
# FINAL Model with Optimization Combined Steps
GBC_final = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, min_samples_leaf=1, max_features=None, random_state=12)
GBC_final.fit(X_train, y_train)

GBC_test_score = accuracy_score(y_true=y_test, y_pred=GBC_final.predict(X_test))
print("Accuracy score on the separated test data set:", GBC_test_score)

#GBC Final Model Accuracy = 0.8751144479033144
#KAGGLE Score = 0.85604 --> compared to UNOPTIMIZED Score = 0.85109
Accuracy score on the separated test data set: 0.8751144479033144
In [ ]:
#Calculating Mean CV and Standard Deviation for Model Evaluation

GBC_mean_final, GBC_std_final = np.mean(cross_val_score(GBC_final, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(GBC_final, X=X_test, y=y_test, cv=5))/np.sqrt(5)

2.3 Linear Discriminant Analysis¶

LDA classifier produces linear decision boundaries, known as hyperplanes which is a flat affine subspace of dimention $p-1$. LDA fits a Gaussian density to each class, assuming that all classes share the same covariance matrix.

It was attempted to apply regularization through the shrinkage parameter implemented in sk-learn which, according to the library, "shrinkage is a form of regularization used to improve the estimation of covariance matrices". However, the accuracy score was the same with and without regularization: 84.91%.

In [ ]:
lda_pipe = Pipeline([("scaler", StandardScaler()), ("lda", LinearDiscriminantAnalysis(solver="lsqr"))])
In [ ]:
lda_pipe.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('lda', LinearDiscriminantAnalysis(solver='lsqr'))])

GridSearchCV was used to find the optimal value for the shrinkage hyper-parameter.

In [ ]:
shrinkage = 10**np.linspace(-4,0, 100)
#use of StratifiedKFold
folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
grid_cv=GridSearchCV(estimator=lda_pipe, param_grid={"lda__shrinkage":shrinkage}, scoring="accuracy", cv=folds, n_jobs=-1)
In [ ]:
search_cv = grid_cv.fit(X_train,y_train)
In [ ]:
grid_cv.best_estimator_
Out[ ]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('lda',
                 LinearDiscriminantAnalysis(shrinkage=0.0004430621457583882,
                                            solver='lsqr'))])
In [ ]:
print(f"LDA test accuracy with default parameters = {accuracy_score(y_true=y_test, y_pred=lda_pipe.predict(X_test))}")
LDA test accuracy with default parameters = 0.8491118842702802
In [ ]:
print(f"LDA test accuracy with parameter tuning: {accuracy_score(y_test, grid_cv.best_estimator_.predict(X_test))}")
LDA test accuracy with parameter tuning: 0.8491118842702802

As aforementioned, LDA's best performance is with default parameters and tuning the shrinkage does not beat this score.

For a fair comparison with other models, CV is performed on LDA with default parameters to obtain the mean and the standard deviation.

In [ ]:
LDA_mean_final, LDA_std_final = np.mean(cross_val_score(lda_pipe, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(lda_pipe, X=X_test, y=y_test, cv=5))/np.sqrt(5)
print(LDA_mean_final, LDA_std_final)
0.8459978417434959 0.003905247462099063

2.4 Logistic Regression¶

Continuing with linear classifiers, logistic regression (logreg) was trained on our data. The logistic model assumes that $Y|X=x_{i}$ follows a Bernoulli distribution, the opposite of LDA which assumes Gaussian.

To reduce the risk of overfitting and reduce model complexity, logreg was explored with both ridge and lasso regularizations.

In [ ]:
# ======== Plain LogReg ========
logReg = Pipeline([("scaler", StandardScaler()), ("log_reg", LogisticRegression(solver="saga", random_state=12, n_jobs=4, tol=1e-2))])
logReg.fit(X_train, y_train)
print(f'LogReg test accuracy with default parameters: {accuracy_score(y_test, logReg.predict(X_test))}')
LogReg test accuracy with default parameters: 0.8659586156381615
In [ ]:
Log_mean_CV, Log_std_CV = np.mean(cross_val_score(logReg, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(logReg, X=X_test, y=y_test, cv=5))/np.sqrt(5)
print(Log_mean_CV, Log_std_CV)
0.8507592438059044 0.0041327898644139185

Ridge regularization

In [ ]:
# ======== Ridge regularization ========
log_reg_ridge = Pipeline([("scaler", StandardScaler()), ("log_reg", LogisticRegression(penalty="l2", solver="saga", C=1, random_state=12, n_jobs=4, tol=1e-2))])
log_reg_ridge.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('log_reg',
                 LogisticRegression(C=1, n_jobs=4, random_state=12,
                                    solver='saga', tol=0.01))])
In [ ]:
C =10**np.linspace(-4,0, 100)
folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
In [ ]:
grid_cv=GridSearchCV(estimator=log_reg_ridge, param_grid={"log_reg__C":C}, scoring="accuracy", cv=folds, n_jobs=4)
search_cv=grid_cv.fit(X_train,y_train)
In [ ]:
grid_cv.best_estimator_
Out[ ]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('log_reg',
                 LogisticRegression(C=0.07390722033525783, n_jobs=4,
                                    random_state=12, solver='saga',
                                    tol=0.01))])

best LogRidge estimator:

LogisticRegression(C=0.07390722033525783, n_jobs=4, random_state=12, solver='saga', tol=0.01))])

In [ ]:
print(f"LogReg Ridge test accuracy after hyperparameter tuning: {accuracy_score(y_test, grid_cv.best_estimator_.predict(X_test))}")
LogReg Ridge test accuracy after hyperparameter tuning: 0.8659586156381615
In [ ]:
LogRidge_mean_CV, LogRidge_std_CV = np.mean(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5))/np.sqrt(5)
#print(LogRidge_mean_CV, LogRidge_std_CV)

Lasso regularization

In [ ]:
# ======== Lasso regularization ========
log_reg_lasso = Pipeline([("scaler", StandardScaler()), ("log_reg", LogisticRegression(penalty="l1", solver="saga", C=1, random_state=12, n_jobs=4, tol=1e-2))])
log_reg_lasso.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('log_reg',
                 LogisticRegression(C=1, n_jobs=4, penalty='l1',
                                    random_state=12, solver='saga',
                                    tol=0.01))])
In [ ]:
grid_cv=GridSearchCV(estimator=log_reg_lasso, param_grid={"log_reg__C":C}, scoring="accuracy", cv=folds, n_jobs=4)
search_cv=grid_cv.fit(X_train,y_train)
In [ ]:
print(f"LogReg Lasso test accuracy after hyperparameter tuning: {accuracy_score(y_test, grid_cv.best_estimator_.predict(X_test))}")
LogReg Lasso test accuracy after hyperparameter tuning: 0.8654092657022523
In [ ]:
LogLasso_mean_CV, LogLasso_std_CV = np.mean(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5))/np.sqrt(5)
#print(LogLasso_mean_CV, LogLasso_std_CV)

Assembly of Principal Component Analysis and Logistic Regression

Principal Component Analysis is an unsupervised machine learning model which aims at summarizing the data set into lower predictor spaces known as the Principal Components. PCA comes in handy to detect underlying trends of the data and from a feature engineering perspective, to perform feature extraction of predictors which are most correlated. Thus, the idea of exploring the logreg performance ensambled with PCA seemed appealing.

In [ ]:
# ======== PCA + LogReg Ridge ========
pca_logReg_ridge = Pipeline([("scaler", StandardScaler()), ("pca",PCA(n_components=.95)) ,("log_reg", LogisticRegression(penalty="l2", solver="saga",
                                                                                      random_state=12, tol=1e-2))])
pca_logReg_ridge.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.95)),
                ('log_reg',
                 LogisticRegression(random_state=12, solver='saga', tol=0.01))])
In [ ]:
C =10**np.linspace(-4,0, 100)
p_comp = [.90, .95, .99]
folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
In [ ]:
grid_cv=GridSearchCV(estimator=pca_logReg_ridge, param_grid={"pca__n_components":p_comp ,"log_reg__C":C, }, scoring="accuracy", cv=folds, n_jobs=-1)
search_cv = grid_cv.fit(X_train, y_train)
In [ ]:
grid_cv.best_estimator_
Out[ ]:
Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.99)),
                ('log_reg',
                 LogisticRegression(C=0.20565123083486536, random_state=12,
                                    solver='saga', tol=0.01))])

best estimator:\ Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.99)), ('log_reg', LogisticRegression(C=0.20565123083486536, random_state=12, solver='saga', tol=0.01))])

Comparing the accuracy on test set between default and tuned hyperparameters:

In [ ]:
print(f"PCA + LogReg Accuray on test set with default parameters: {accuracy_score(y_test, pca_logReg_ridge.predict(X_test))}")
PCA + LogReg Accuray on test set with default parameters: 0.8529573338216444
In [ ]:
print(f"PCA + LogReg Ridge Accuray on test set after hyperparameter tuning: {accuracy_score(y_test, grid_cv.best_estimator_.predict(X_test))}")
PCA + LogReg Ridge Accuray on test set after hyperparameter tuning: 0.8586339498260391
In [ ]:
LogPCA_Ridge_mean_CV, LogPCA_Ridge_std_CV = np.mean(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5))/np.sqrt(5)
#print(LogPCA_Ridge_mean_CV, LogPCA_Ridge_std_CV)
In [ ]:
# ======== PCA + LogReg Lasso ========
pca_logReg_lasso = Pipeline([("scaler", StandardScaler()), ("pca",PCA(n_components=.95)) ,("log_reg", LogisticRegression(penalty="l1", solver="saga",
                                                                                      random_state=12, tol=1e-2))])
pca_logReg_lasso.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.95)),
                ('log_reg',
                 LogisticRegression(penalty='l1', random_state=12,
                                    solver='saga', tol=0.01))])
In [ ]:
grid_cv=GridSearchCV(estimator=pca_logReg_lasso, param_grid={"pca__n_components":p_comp ,"log_reg__C":C, }, scoring="accuracy", cv=folds)
search_cv =grid_cv.fit(X_train, y_train)
In [ ]:
grid_cv.best_estimator_
Out[ ]:
Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.99)),
                ('log_reg',
                 LogisticRegression(C=0.9111627561154896, penalty='l1',
                                    random_state=12, solver='saga',
                                    tol=0.01))])

Best estimator\ Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.99)), ('log_reg', LogisticRegression(C=0.9111627561154896, penalty='l1', random_state=12, solver='saga', tol=0.01))])

Comparing the accuracy on test set between default and tuned hyperparameters:

In [ ]:
print(f"PCA + LogReg Lasso test accuracy without applying regularization: {accuracy_score(y_test, pca_logReg_lasso.predict(X_test))}")
PCA + LogReg Lasso test accuracy without applying regularization: 0.8527742171763413
In [ ]:
print(f"PCA + LogReg Lasso Accuray on test set after hyperparameter tuning: {accuracy_score(y_test, grid_cv.best_estimator_.predict(X_test))}")
PCA + LogReg Lasso Accuray on test set after hyperparameter tuning: 0.8586339498260391
In [ ]:
LogPCA_Lasso_mean_CV, LogPCA_Lasso_std_CV = np.mean(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5))/np.sqrt(5)
#print(LogPCA_Lasso_mean_CV, LogPCA_Lasso_std_CV)

2.4.1 Visualization of Model Performance¶

The aim of this section is to provide an intuitive comparison of the performance of the logreg-based models that were built on this section.

Model hyperparameters were tuned using GridSearchCV. The best estimator was retrieved and cross-validation was performed on the test set for a fair comparison of the mean score and the standard deviation of the mean score.

In [2]:
accucary_test_df = pd.DataFrame({"Accuracy": [0.8507, 0.8492, 0.8443, 0.8441, 0.8447],
                                "std": [0.0041, 0.0041, 0.0040, 0.0042, 0.0041]},
                               index=["LogReg","LogRidge", "LogLasso", "PCA_LogRidge", "PCA_LogLasso"])
In [3]:
accuracy_test= px.scatter(data_frame=accucary_test_df, x=accucary_test_df.index, y="Accuracy",
           symbol=accucary_test_df.index, template="presentation", error_y="std",
            color=accucary_test_df.index, width=700,height=400,
                                  labels={"index": "Estimators"}, title="Performance of different Estimators on Test Set")
accuracy_test.update_layout(font=dict(size=12), legend_title= "Estimators")

On one hand, there were models which did not perform better after hyperparameter tuning, an extreme case is LogLasso.

On the other hand, there were models with accuracy scores that improved after hyperparameter tuning, as was the case for both PCA-LogReg models. Nonetheless, its performance was still worse compared to plain LogReg.

As discussed during practical classes, ridge regularization aids in improving prediction accuracy while lasso aids in variable selection.

The decision to retain a model was based on theoretical aspects of improving model's prediction and reduction of model complexity. As shown in the graph above, the performace of LogReg and LogRidge models are somewhat similar in terms of their score and their imprecission. The standar deviations for the models also overlap; thus we suppose that LogRidge is the "best model" of this sub-section.

In [ ]:
#### Computing mean cv score and std for the models of this section which will be included on stacking ####
# ==== LDA ====
LDA_mean_final, LDA_std_final = np.mean(cross_val_score(lda_pipe, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(lda_pipe, X=X_test, y=y_test, cv=5))/np.sqrt(5)

# ==== Logistic ====
log_ridge_final = Pipeline([("scaler", StandardScaler()), ("log_reg", LogisticRegression(penalty="l2", solver="saga", C=0.07390722033525783, random_state=12, tol=1e-2))])

LogReg_mean_final, LogReg_std_final = np.mean(cross_val_score(log_ridge_final, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(log_ridge_final, X=X_test, y=y_test, cv=5))/np.sqrt(5)

LDA_test_score, LogRidge_test_score = 0.8491118842702802, 0.8659586156381615

2.5 Stacking¶

This final model represents an ensemble of optimized models we previously presented. We include all four models: random forest, gradient boosting, linear discrimnant analysis, and logistic regression with the parameters obtained from optimization. As a final estimator, we use logistic regression. The model performs well, but was not able to outperform other methods on the separated test data set. We also uploaded a prediction based on the Kaggle test data set which performed slightly worse than the optimized random forest model.

In [ ]:
# Define estimators used for stacking
estimators = [
('rfc', RandomForestClassifier(max_depth=None, max_features=18, min_samples_leaf = 1,
                                  n_estimators=250, random_state=12, n_jobs=-1)),
("lda", make_pipeline(StandardScaler(),
                          LinearDiscriminantAnalysis(solver="lsqr"))),
('LogRidge', make_pipeline(StandardScaler(),
                          LogisticRegression(penalty="l2", solver="saga",
                                             C=0.07390722033525783, random_state=12,
                                             tol=1e-2))),
('gbc', GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3,
                                      min_samples_split=2, min_samples_leaf=1, max_features=None, random_state=12))]

# Define model ensemble
StackingClas = StackingClassifier(estimators=estimators, stack_method="predict_proba",
                                  final_estimator=LogisticRegression(random_state=12),
                                  n_jobs=-1)

# Train model
StackingClas.fit(X_train, y_train)

# Evaluate model on separated test data set
stacking_test_score = accuracy_score(y_true=y_test, y_pred=StackingClas.predict(X_test))
print("Accuracy score on separated test data set", stacking_test_score)
Accuracy score on separated test data set 0.8723676982237686

[3] Model Evaluation¶

Here, we compare the different models which were developped in the previous sections to determine the final model for submission.

3.1 Accuracy Table and Test Scores¶

Since we used cross-validation for the single models, the cross-validation mean score is comparable across all models. Results are summarized in the table below.

In [ ]:
# Accuracy score and SD for 10% of the data 
model_accuracy_df = pd.DataFrame({"Accuracy": [rfc_meanCV, GBC_mean_final, LDA_mean_final, LogReg_mean_final],
                               "SD": [rfc_stdCV, GBC_std_final, LDA_std_final, LogReg_std_final]},
                               index=["RFC", "GBC", "LDA","LogRidge"])
model_accuracy_df
Out[ ]:
Accuracy SD
RFC 0.860538 0.001278
GBC 0.856434 0.004139
LDA 0.845998 0.003905
LogRidge 0.849294 0.004188

We initially sliced 10% of the training data for the evaluation of the stacked model. Predictions of all models in the previous section were conducted on this test data set; scores can be found in the following table.

In [ ]:
# Test scores from tested models

test_score_df = pd.DataFrame({"Score": [rfc_test_score, GBC_test_score, LDA_test_score, LogRidge_test_score, stacking_test_score]},
                             index=["RFC", "GBC", "LDA", "LogRidge", "Stacking"])

test_score_df
Out[ ]:
Score
RFC 0.869987
GBC 0.875114
LDA 0.849112
LogRidge 0.865959
Stacking 0.872368

Preliminary conclusion:

Random forest classifier has the best accuracy and standard deviation scores of cross-valiation; However, on the separated test data set, GBC and stacking perform better.

3.2 Confusion Matrices¶

To further support our decision in determining the best model, we created confusion matrices of our top models, random forest classifier and gradient boosting classifier.

The matrices below evidence that classification accuracy of the random forest was notably higher at ~99.2% compared to the gradient booster's accuracy of ~87.5%. Type 1 and Type 2 errors also occurred less with the RFC model than the GBC model. Sensitiy and specificty of the RFC was also superior at ~99.2% and ~98.9% respectively; while GBC's model sensitivity was ~90.7% and sensitivity was ~80.3%.

In [ ]:
# Random Forest Classifier
confusion_matrix(rfc_final, X_train, y_train, X_test, y_test, classes=["0", "1"], percent=False);
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
  "X does not have valid feature names, but"
In [ ]:
# Gradient Boosting Classifier
confusion_matrix(GBC_final, X_train, y_train, X_test, y_test, classes=["0", "1"], percent=False);
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but GradientBoostingClassifier was fitted with feature names
  "X does not have valid feature names, but"

[4] Conclusion¶

Based on the accuracy and standard deviation scores as well as the confusion matrices; we conclude that the random forest classifier is our best predictive model.

While comparing testing scores across the various models did demonstrate that the gradient boosting classifier slightly outperforms other methods, the RFC testing score is nearly equal to GBC. Apart from the model evaluation conducted above, Kaggle submissions also showed that the optimized RFC outperforms the optimized GBC, with scores of 0.86098 and 0.85604 respectively. The stacked model also performed worse than the single optimized RFC model, which is why we decided not to explore this approach in more detail.

While we are not surprised with these results since RFCs are robust models (which often do not require hyperparameter tuning and are less prone to overfitting), we recognize that different data cleaning and NA imputation approaches impact the performance of models; no matter how slightly. Other imputation methods (summarized in the Supplementary Notebooks) or further summarization of categorial variables would have perhaps resulted in another final model if we investigated the different possibilities even further. Alternatively, other methods of hyperparameter tuning may have been applied to the gradient boosting model, which is at times known to outperform random forest models.

Nevertheless, due to limitations in time and computational power of our personal computers, we chose to move forward with the RFC as our final model.

4.1 Final Model Fitting¶

Previously, we utilized 10% of the train data during optimization of the RFC for faster computation. Since we have determined that the RFC outperforms other modalities of classifiers, we used the whole training dataset for the final parameter optimization. Of note, grid parameters were not modified from the inital RFC model optimization. Steps for the final model fitting can be reviewed below.

  • We initally recall the complete train data set (with data cleaning and imputation steps already implemented).
  • The random forest classifier is then optimized utilizing the grid parameters previously applied based on the whole training data instead of a small subset.
  • Best parameters are applied to the RFC and fitted on the entire training data.
  • The optimized model is finally used to predict on Kaggle test data and exported for final submission.
  • On the public leaderboard, we yielded a score of 0.85659 using this final model.
In [ ]:
# Reload complete train data set
train_df = train_df_original.copy()
# Create X and y
X = train_df.drop("high_income", axis=1)
y = train_df["high_income"]
In [ ]:
# Final Random Forest Classifier Optimization
start_time = timeit.default_timer()

# Number of trees in random forest
n_estimators = [100, 250]
# Number of features to consider at every split
max_features = ['sqrt', 16, 18, 20, 22, 24, 26, 28, 30]
# number of samples required at each leaf node
min_samples_leaf = [1, 3, 5]
# max_depth
max_depth = [20, 35, None]

# Create the random grid
p_grid = {'n_estimators': n_estimators,
            'max_features': max_features,
               'min_samples_leaf': min_samples_leaf,
                  'max_depth': max_depth}

# Create the Classifier
rfc_final_op = RandomForestClassifier(random_state=12)

# Instantiate the grid search model
folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
grid_cv_final=GridSearchCV(estimator=rfc_final_op, param_grid=p_grid, scoring="accuracy", cv=folds, n_jobs=-1)

# Fit the grid search to the data
grid_cv_final.fit(X, y)

best_param = grid_cv_final.best_params_
best_score = grid_cv_final.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)

elapsed = timeit.default_timer() - start_time
elapsed_min = elapsed/60
print("Time [min]:", elapsed_min)
Best parameters: {'max_depth': None, 'max_features': 28, 'min_samples_leaf': 3, 'n_estimators': 250}
Best score: 0.8616844424175281
Time [min]: 138.59173749463335
In [ ]:
# Define Model
#RandomForestClassifier(max_depth=None , max_features=28, min_samples_leaf=3, n_estimators=250 , random_state=12, n_jobs=-1)
rfc_kaggle = grid_cv_final.best_estimator_
rfc_kaggle.fit(X,y)

# Random Forest Classifier Prediction on Kaggle Test Data
y_pred = rfc_kaggle.predict(X_kaggle)

Exporting .csv file for Kaggle Submission¶

In [ ]:
# Create csv
index = np.linspace(1, len(y_pred), len(y_pred))
final_output = {'id': index, 'high_income': y_pred}
final_output = pd.DataFrame(final_output)
final_output["id"] = final_output["id"].astype('int64')
final_output["high_income"] = final_output["high_income"].astype('int64')
final_output.to_csv('final_output.csv', index=False, header=["id", "high_income"])
In [ ]:
final_output.to_csv('final_output.csv')

Final Score¶

10th position with an $accuracy = 0.85687$